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Abstract 



This paper presents a diffusion based probabilistic interpretation of 
spectral clustering and dimensionality reduction algorithms that use the 
eigenvectors of the normalized graph Laplacian. Given the pairwise adja- 
cency matrix of all points, we define a diffusion distance between any two 
data points and show that the low dimensional representation of the data 
f^*) | by the first few eigenvectors of the corresponding Markov matrix is opti- 

Q\ . mal under a certain mean squared error criterion. Furthermore, assuming 

that data points are random samples from a density p(x) = e^ u<yX ^ we 
identify these eigenvectors as discrete approximations of eigenfunctions 
of a Fokker-Planck operator in a potential 2U(x) with reflecting bound- 
ary conditions. Finally, applying known results regarding the eigenvalues 
and eigenfunctions of the continuous Fokker-Planck operator, we provide 
' a mathematical justification for the success of spectral clustering and di- 

mensional reduction algorithms based on these first few eigenvectors. 
This analysis elucidates, in terms of the characteristics of diffusion pro- 
cesses, many empirical findings regarding spectral clustering algorithms. 
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1 Introduction 



Clustering and low dimensional representation of high dimensional data are important 
problems in many diverse fields. In recent years various spectral methods to perform these 
tasks, based on the eigenvectors of adjacency matrices of graphs on the data have been 
developed, see for example 1 1 1-[9 1 and references therein. In the simplest version, known 
as the normalized graph Laplacian, given n data points {a?i}" =1 where each Xj 6 R p , we 
define a pairwise similarity matrix between points, for example using a Gaussian kernel 
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with width e, 

Lij = k(xi,Xj) = cxp (- — ^ (1) 

and a diagonal normalization matrix Dj^ = J^j Li.j- Many works propose to use the 
first few eigenvectors of the normalized eigenvalue problem L<j> = \D<f>, or equivalently 
of the matrix M = D^ 1 L, either as a low dimensional representation of data or as good 
coordinates for clustering purposes. Although eq. (1) is based on a Gaussian kernel, other 
kernels are possible. While for actual datasets the choice of a kernel k(xi, Xj) is crucial, it 
does not qualitatively change our asymptotic analysis 1101 . 

The use of the first few eigenvectors of M as good coordinates is typically justified with 
heuristic arguments or as a relaxation of a discrete clustering problem Q. In (HE) Belkin 
and Niyogi showed that when data is uniformly sampled from a low dimensional manifold 
of W the first few eigenvectors of M are discrete approximations of the eigenfunctions of 
the Laplace-Beltrami operator on the manifold, thus providing a mathematical justification 
for their use in this case. A different theoretical analysis of the eigenvectors of the matrix 
M, based on the fact that M is a stochastic matrix and thus represents a random walk on 
the graph has been described by Meila and Shi II II . who considered the case of piecewise 
constant eigenvectors for specific lumpable matrix structures. Another notable work that 
considered the random walk aspects of spectral clustering is 1171 1121 . where the authors 
suggest clustering based on the average commute time between points. 

In this paper we provide a unified probabilistic framework which combines these results 
and extends them in two different directions. First, in section[2]we define a distance func- 
tion between any two points based on the random walk on the graph, which we naturally 
denote the diffusion distance. We then show that the low dimensional description of the 
data by the first few eigenvectors, denoted as the diffusion map, is optimal under a mean 
squared error criterion based on this distance. In section|3]we consider a statistical model, 
in which data points are iid random samples from a probability density p(x) in a smooth 
bounded domain £1 C W and analyze the asymptotics of the eigenvectors as the number of 
data points tends to infinity. This analysis shows that the eigenvectors of the finite matrix 
M are discrete approximations of the eigenfunctions of a Fokker-Planck (FP) operator with 
reflecting boundary conditions. This observation, coupled with known results regarding the 
eigenvalues and eigenfunctions of the FP operator provide new insights into the properties 
of these eigenvectors and on the performance of spectral clustering algorithms, as described 
in section [4] 



2 Diffusion Distances and Diffusion Maps 

The starting point of our analysis, as also noted in other works, is the observation that the 
matrix M is adjoint to a symmetric matrix 

M s = D 1/2 MD- 1/2 (2) 

Thus, M and M s share the same eigenvalues. Moreover, since M s is symmetric it is 
diagonalizable and has a set of n eigenvalues {Aj}™^ 1 whose corresponding eigenvectors 
{vj} form an orthonormal basis of M. n . The left and right eigenvectors of M, denoted <pj 
and ipj are related to those of M s according to 

(j>j = vjD 1 / 2 , ipj = VjD- 1 ' 2 (3) 

Since the eigenvectors Vj are orthonormal under the standard dot product in R", it follows 
that the vectors <f>j and ipk are bi-orthonormal 



{<pi,ipj) = Si j 



(4) 



where (it, v) is the standard dot product between two vectors in R™. We now utilize the 
fact that by construction M is a stochastic matrix with all row sums equal to one, and can 
thus be interpreted as defining a random walk on the graph. Under this view, Mij denotes 
the transition probability from the point Xi to the point Xj in one time step. Furthermore, 
based on the similarity of the Gaussian kernel ([Q to the fundamental solution of the heat 
equation, we define our time step as At = e. Therefore, 

Pr{x(t + e) = Xj | x(t) = x t } = M itj (5) 

Note that e has therefore a dual interpretation in this framework. The first is that e is the 
(squared) radius of the neighborhood used to infer local geometric and density information 
for the construction of the adjacency matrix, while the second is that e is the discrete time 
step at which the random walk jumps from point to point. 

We denote by p(t, y\x) the probability distribution of a random walk landing at location 
y at time t, given a starting location x at time t = 0. For t = ke, p(t,y\xi) = eiM k , 
where is a row vector of zeros with a single one at the z-th coordinate. For e large 
enough, all points in the graph are connected so that M has a unique eigenvalue equal 
to 1. The other eigenvalues form a non-increasing sequence of non-negative numbers: 
Ao = 1 > Ai > A2 > . . . > A„_i > 0. Then, regardless of the initial starting point x, 

lim p(t,y\x) = <f> (y) (6) 

t — >oc 

where c/)q is the left eigenvalue of M with eigenvalue Ao = 1, explicitly given by 

M x i) = v^TT— ( ? ) 
j 

This eigenvector also has a dual interpretation. The first is that 0o is the stationary prob- 
ability distribution on the graph, while the second is that 4>o(x) is the density estimate at 
the point x. Note that for a general shift invariant kernel K(x — y) and for the Gaussian 
kernel in particular, <fio is simply the well known Parzen window density estimator. 

For any finite time t, we decompose the probability distribution in the eigenbasis {<f>j} 

p{t,y\x) = Mv) + ^a 3 {x)\]4> ] {y) (8) 

where the coefficients a,j depend on the initial location x. Using the bi-orthonormality 
condition © gives aj(x) — ipj(x), with ao(x) — ipo{&) = 1 already implicit in 0. 

Given the definition of the random walk on the graph it is only natural to quantify the 
similarity between any two points based on the evolution of their probability distributions. 
Specifically, we consider the following distance measure at time t, 

D^(xo,x t ) = \\p(t,y\x ) -p(i,y|a;i)||^ (9) 

= ^2(p(t, y\x ) - p(t, y|a;i)) 2 w(y) 

y 

with the specific choice w(y) — 1 /<fio(y) for the weight function, which takes into account 
the (empirical) local density of the points. 

Since this distance depends on the random walk on the graph, we quite naturally denote it 
as the diffusion distance at time t. We also denote the mapping between the original space 
and the first k eigenvectors as the diffusion map 

^(x) = (A^i(x) ) A|V 2 (x),...,A t fe ^(x)) (10) 

The following theorem relates the diffusion distance and the diffusion map. 



Theorem: The diffusion distance l|9} is equal to Euclidean distance in the diffusion map 
space with all (n — 1) eigenvectors. 

£> t 2 (x 0> xi) - Y X f ^ x °) - V^'^i)) 2 = ll*t(*o) - *t(*i)|| 2 (ID 
Proof: Combining and l|9} gives 

D 2 t {x , Xl ) = Y (Y x i(M*a) ~ ^(^i))^(y)) 2 i/0o(y) (12) 
y j 

Expanding the brackets, exchanging the order of summation and using relations Q and @ 
between (f>j and ipj yields the required result. Note that the weight factor 1 /</>o is essential 
for the theorem to hold. □. 

This theorem provides a justification for using Euclidean distance in the diffusion map 
space for spectral clustering purposes. Therefore, geometry in diffusion space is meaning- 
ful and can be interpreted in terms of the Markov chain. In particular, as shown in 1 15], 
quantizing this diffusion space is equivalent to lumping the random walk. Moreover, since 
in many practical applications the spectrum of the matrix M has a spectral gap with only 
a few eigenvalues close to one and all additional eigenvalues much smaller than one, the 
diffusion distance at a large enough time t can be well approximated by only the first few 
fc eigenvectors ipi(x) , . . . , tpk{x), with a negligible error of the order of 0((Afe+i/Afc)*). 
This observation provides a theoretical justification for dimensional reduction with these 
eigenvectors. In addition, the following theorem shows that this fc-dimensional approxima- 
tion is optimal under a certain mean squared error criterion. 

Theorem: Out of all fc-dimensional approximations of the form 

k 

p(t,y\x) = <j) (y) + Y a j( x i t ) w j(y) 

for the probability distribution at time t, the one that minimizes the mean squared error 

E x {\\p(t,y\x)-p(t,y\x)\\l} 

where averaging over initial points x is with respect to the stationary density 4>o(x), is 
given by Wj(y) — <f>j(y) and aj(t,x) — X t jipj(x). Therefore, the optimal fc-dimensional 
approximation is given by the truncated sum 

k 

p{y,t\x)=<t> Q {y) + Y X )^ x ) < l ) M < 13 > 
i=i 

Proof: The proof is a consequence of a weighted principal component analysis applied to 
the matrix M, taking into account the biorthogonality of the left and right eigenvectors. 

We note that the first few eigenvectors are also optimal under other criteria, for example for 
data sampled from a manifold as in |4|, or for multiclass spectral clustering 1 151 . 

3 The Asymptotics of the Diffusion Map 

The analysis of the previous section provides a mathematical explanation for the success of 
the diffusion maps for dimensionality reduction and spectral clustering. However, it does 
not provide any information regarding the structure of the computed eigenvectors. 

To this end, we introduce a statistical model and assume that the data points {a^} are i.i.d. 
random samples from a probability density p(x) confined to a compact connected subset 



£1 C M. p with smooth boundary dfl. Following the statistical physics notation, we write the 
density in Boltzmann form, p(x) = er u ^ x \ where U(x) is the (dimensionless) potential 
or energy of the configuration x. 

As shown in 1 10 1, in the limit n — > oo the random walk on the discrete graph converges 
to a random walk on the continuous space fl Then, it is possible to define forward and 
backward operators Tf and Tb as follows, 

T f [4>](x) = f n M(x\y)cf>(y)p(y)dy 

(14) 

TbW\ O) - f n M(y\x)i>(y)p(y)dy 

where M(x\y) = cxp(— \\x — y\\ 2 /2e) / D(y) is the transition probability from y to x in 
time e, and D(y) = J cxp(— \\x — y\\ 2 /2e)p(x)dx. 

The two operators Tf and Tb have probabilistic interpretations. If 4>(x) is a probability 
distribution on the graph at time t = 0, then Tf [<fi] is the probability distribution at time 
t = e. Similarly, Tb[tp](x) is the mean of the function i/j at time t = e, for a random walk 
that started at location x at time t = 0. The operators Tf and Tb are thus the continuous 
analogues of the left and right multiplication by the finite matrix M. In 1141 a rigorous 
mathematical proof of the convergence of the eigenvalues and eigenvectors of the discrete 
matrix M to the eigenvalues and eigenfunctions of the integral operators Tf and Tj, is given. 

We now take this analysis one step further and consider the limit e — > 0. This is possible, 
since when n = oo each data point contains an infinite number of nearby neighbors. In 
this limit, since e also has the interpretation of a time step, the random walk converges to a 
diffusion process, whose probability density evolves continuously in time, according to 

dpi^ = p( x ,t + e)-p{x,t) = 

in which case it is customary to study the infinitesimal generators (propagators) 

Hf = lim Tf - 1 , li b = lim ^— - (16) 

e^Q e e^Q £ 

Clearly, the eigenfunctions of Tf and Tb converge to those of Hf and Hb, respectively. 
As shown in 1 10 1, the backward generator is given by the following Fokker-Planck operator 

H b ^ = A^-2V4>-VU (17) 
which corresponds to a diffusion process in a potential field of 2U (x) 

x(t) = -V(2Z7) + V2Dw(t) (18) 

where w(t) is standard Brownian motion in p dimensions and D is the diffusion coefficient, 
equal to one in equation (I17> . The Langevin equation (II 8b is the most common equation to 
describe stochastic dynamical systems in physics, chemistry and biology 1 161 1171 . As such, 
its characteristics as well as those of the corresponding FP equation have been extensively 
studied, see [ 16 1-| 19 1 and many others. The term • VU in dl7> is interpreted as a drift 
term towards low energy (high-density) regions, and as discussed in the next section, may 
play a crucial part in the definition of clusters. 

Note that when data is uniformly sampled from O, VU = so the drift term vanishes and 
we recover the Laplace-Beltrami operator on ft. The convergence (in probability) of the 
discrete matrix M to this operator has been recently rigorously proven in [20|. However, 
the important issue of boundary conditions was not considered. 

Since Mil is defined in the bounded domain fi, the eigenvalues and eigenfunctions of Hb 
depend on the boundary conditions imposed on dil. As shown in [8 1, in the limit e — > 0, 



Table 1 : Random Walks and Diffusion Processes 



Case 


Operator 


Stochastic Process 


£ > 

n < oo 


finite n x n 
matrix M 


R.W. discrete in space 
discrete in time 


e > 
n — * oo 


operators 

Tf,T b 


R.W. continuous in space 
discrete in time 


e^O 
n = oo 


infinitesimal 
generator H j 


diffusion process 
continuous in time & space 



the random walk satisfies reflecting boundary conditions on d£l, which translate into 

dtp (x 



= (19) 

an 



dn 

where n is a unit normal vector at the point x e dil. 

To conclude, the left and right eigenvectors of the finite matrix M can be viewed as discrete 
approximations to those of the operators Tf and Tb, which in turn can be viewed as approx- 
imations to those of Hf and Hb- Therefore, if there are enough data points for accurate 
statistical sampling, the structure and characteristics of the eigenvalues and eigenfunctions 
of Hb are similar to the corresponding eigenvalues and discrete eigenvectors of M. For 
convenience, the three different stochastic processes are shown in table 1 . 



4 Fokker-Planck eigenfunctions and spectral clustering 

According to dl6> , if X £ is an eigenvalue of the matrix M or of the integral operator Tb based 
on a kernel with parameter e, then the corresponding eigenvalue of Hb is fx ~ (A e — l)/e. 
Therefore the largest eigenvalues of M correspond to the smallest eigenvalues of Hb- These 
eigenvalues and their corresponding eigenfunctions have been extensively studied in the 
literature under various settings. In general, the eigenvalues and eigenfunctions depend 
both on the geometry of the domain fl and on the profile of the potential U (x). For clarity 
and due to lack of space we briefly analyze here two extreme cases. In the first case ft = W 
so geometry plays no role, while in the second U (x) = const so density plays no role. 
Yet we show that in both cases there can still be well defined clusters, with the unifying 
probabilistic concept being that the mean exit time from one cluster to another is much 
larger than the characteristic equilibration time inside each cluster. 

Case I: Consider diffusion in a smooth potential field U(x) in ft = M p , where U has a few 
local minima, and U(x) — > oo as ||cc|| — » oo fast enough so that J e dx = 1 < oo. Each 
such local minimum thus defines a metastable state, with transitions between metastable 
states being relatively rare events, depending on the barrier heights separating them. As 
shown in 1181 \19 1 (and in many other works) there is an intimate connection between the 
smallest eigenvalues of Hb and mean exit times out of these metastable states. Specifically, 
in the asymptotic limit of small noise D -C 1, exit times are exponentially distributed and 
the first non-trivial eigenvalue (after /io = 0) is given by /ii = 1/f where f is the mean exit 
time to overcome the highest potential barrier on the way to the deepest potential well. For 
the case of two potential wells, for example, the corresponding eigenfunction is roughly 
constant in each well with a sharp transition near the saddle point between the wells. In 
general, in the case of k local minima there are asymptotically only k eigenvalues very close 
to zero. Apart from /io = 0, each of the other k — 1 eigenvalues corresponds to the mean 
exit time from one of the wells into the deepest one, with the corresponding eigenfunctions 
being almost constant in each well. Therefore, for a finite dataset the presence of only k 
eigenvalues close to 1 with a spectral gap, e.g. a large difference between and Xk+i 



is indicative of k well defined global clusters. In figure 1 (left) an example of this case is 
shown, where p(x) is the sum of two well separated Gaussian clouds leading to a double 
well potential. Indeed there are only two eigenvalues close or equal to 1 with a distinct 
spectral gap and the first eigenfunction being almost piecewise constant in each well. 

In stochastic dynamical systems a spectral gap corresponds to a separation of time scales 
between long transition times from one well or metastable state to another as compared 
to short equilibration times inside each well. Therefore, clustering and identification of 
metastable states are very similar tasks, and not surprisingly algorithms similar to the nor- 
malized graph Laplacian have been independently developed in the literature 1 21 1. 

The above mentioned results are asymptotic in the small noise limit. In practical datasets, 
there can be clusters of different scales, where a global analysis with a single e is not suit- 
able. As an example consider the second dataset in figure 1, with three clusters. While 
the first eigenvector distinguishes between the large cluster and the two smaller ones, the 
second eigenvector captures the equilibration inside the large cluster instead of further dis- 
tinguishing the two small clusters. While a theoretical explanation is beyond the scope of 
this paper, a possible solution is to choose a location dependent e, as proposed in \ 22\. 

Case II: Consider a uniform density in a region C t 3 composed of two large con- 
tainers connected by a narrow circular tube, as in the top right frame in figure 1 . In this 
case U(x) = const, so the second term in illi vanishes. As shown in |23|, the second 
eigenvalue of the FP operator is extremely small, of the order of a/V where a is the radius 
of the connecting tube and V is the volume of the containers, thus showing an interesting 
connection to the Cheeger constant on graphs. The corresponding eigenfunction is almost 
piecewise constant in each container with a sharp transition in the connecting tube. Even 
though in this case the density is uniform, there still is a spectral gap with two well defined 
clusters (the two containers), defined entirely by the geometry of il. An example of such a 
case and the results of the diffusion map are shown in figure 1 (right). 

In summary the eigenfunctions and eigenvalues of the FP operator, and thus of the corre- 
sponding finite Markov matrix, depend on both geometry and density. The diffusion dis- 
tance and its close relation to mean exit times between different clusters is the quantity that 
incorporates these two features. This provides novel insight into spectral clustering algo- 
rithms, as well as a theoretical justification for the algorithm in 1121 . which defines clusters 
according to mean travel times between points on the graph. Finally, for dynamical systems 
these eigenvectors can be used to design better search and data collection protocols |24|. 
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Figure 1: Diffusion map results on different datasets. Top - the datasets. Middle - the 
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